Skip to content

blobstore: support parallel download option and createParentPath option in AWS and GCP#377

Open
LihaoLiuXs wants to merge 17 commits intosalesforce:mainfrom
LihaoLiuXs:parallelDownload
Open

blobstore: support parallel download option and createParentPath option in AWS and GCP#377
LihaoLiuXs wants to merge 17 commits intosalesforce:mainfrom
LihaoLiuXs:parallelDownload

Conversation

@LihaoLiuXs
Copy link
Copy Markdown
Collaborator

@LihaoLiuXs LihaoLiuXs commented Apr 8, 2026

Summary

Moved the parallelDownd option to asyncClient. Since we don't have the conformance test for async client. I tested it locally with example.

GCP test result:
BUCKET_NAME=palsfdc mvn -q exec:java -Dexec.mainClass=com.salesforce.multicloudj.blob.Main -Dbucket.name=palsfdc 2>&1
=== STARTING PARALLEL DOWNLOAD TESTS ===
Provider: gcp
=== Testing Parallel Download (Path) ===
SLF4J(W): No SLF4J providers were found.
SLF4J(W): Defaulting to no-operation (NOP) logger implementation
SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J(W): Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J(W): Ignoring binding found at [jar:file:/Users/barry.liu/.m2/repository/org/slf4j/slf4j-simple/1.7.2/slf4j-simple-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J(W): See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
[parallelDownloadToPath] Uploading test blob: key=parallel-download-path-1776870394493.bin, size=16777216
[parallelDownloadToPath] key=parallel-download-path-1776870394493.bin, bytes=16777216, elapsedMs=1721, etag=CNLan4begZQDEAE=
[parallelDownloadToPath] SUCCESS
=== Testing Parallel Download (createParentPath) ===
[parallelDownloadWithCreateParentPath] Uploading nested-key blob: key=parallel/nested/dir/blob-1776870398374.bin
[parallelDownloadWithCreateParentPath] resolvedPath=/tmp/parallel-download-root-1776870398374/parallel/nested/dir/blob-1776870398374.bin, bytes=8388608, etag=COLJzofegZQDEAE=
[parallelDownloadWithCreateParentPath] SUCCESS
=== Testing Parallel Download (with Range) ===
[parallelDownloadWithRange] key=parallel-download-range-1776870400563.bin, bytes=3073, expected=3073
[parallelDownloadWithRange] SUCCESS
=== Testing Parallel vs Sequential Download ===
[parallelVsSequentialDownload] sequentialMs=2349, parallelMs=2422, sizeBytes=33554432, seqDownloadedBytes=33554432, parDownloadedBytes=33554432
[parallelVsSequentialDownload] SUCCESS
Parallel download tests completed!
Screenshot 2026-04-22 at 11 51 33 AM

AWS test result:

AWS_ACCESS_KEY_ID=ASIAZQ3DQ6RHW2RLTYBV AWS_SECRET_ACCESS_KEY=ptpSaDk3289DvAAa1a2nWydKWorUHOCv6C5EZirZ AWS_SESSION_TOKEN=IQoJb3JpZ2luX2VjEIj//////////wEaCXVzLWVhc3QtMSJHMEUCIQDkYobo2adHLDVDa+nGaRy4fMqPI+Kn5JXDOQuNqtJbCwIgGG0YDYe4mssxmxR0SXi05yynyLm9iMOoqCIgmBgBYz8qoQMIURAAGgw2NTQ2NTQzNzA4OTUiDDehr5JzqAAf6rhQNir+AhcU5dA7Rvp02CwPTI5f2u6StBC4zzKXwhmfHSFvhDr94unO6627FuL2s0TZTqoTWzCiMPeUlDbQsAECdnbMx+2WUyNNcMx3arePqqaH2WBTY8CnSTLEYIZ/JKUrvs9oSCC8gw79hW37tlQVN6fJWlHu4LnXQcECnt6NUCGH0TC4P3udxZs52GZ7cNxyxhhT8XULEuaz2iGPmDbVNW/aLT8ACAxZqnuYS0P0s/Fu7RWdOI2JW2Yjavt6+rq/ndDMzZpBKXTfGxCV0i9XPIpLhRA2VXAI9gZ5vqNSXVPpwhSFRViATPvppkHFOZ0j2K0ZnqOzo0ZgwTzcbPemo1e1retBNX0p5o7q6CIpwdcF549NmVvAGmvrPRQuYWEJUnMKRvTkNgPR0B6qCT4Nnr0zm4JJwXYpgPtns9gzfIngj/mjFxIK2reP92G/YKh/DDGeFt21sK93SyZmDk50qu+58TKwrNcMG88O9+hkltvhySc+o8b8Gu5cpbyCTZiggLIwy92jzwY63gFbDq00OpQM0b4/EsIVRKQLF6ONqU+QYqaRX1I9n/4TgKLz1/vYVOwb34450QreE8PakIhIQlnfPv2JNlQh8SzLmOGSF8BHX1+3rh+pEU9rPTo3mjJXts18LMtnFwP/ExOJ/A9dgWdQzRgX6GN2akSgC5pdnpoKZaGMVm5Ft7fU96EGl3D1cUiwcmrGHBHCbGeeND54tMoA+Xx/XhO7uZjdwlHHjMKgJuZmmOp76Nd/IFdBZ7Gfw9kfXfZdinVS8ilxzOJFzhvM0/jBIEL1fDFxmUgXKSUEaa0pEAiXOqg= AWS_REGION=us-west-2 BUCKET_NAME=example-test-bucket-barry BUCKET_REGION=us-west-2 mvn -q exec:java -Dexec.mainClass=com.salesforce.multicloudj.blob.Main -Dprovider=aws -Dbucket.name=example-test-bucket-barry -Dbucket.region=us-west-2 2>&1 | grep -vE '^\s+at |Caused by:' | tail -60
=== STARTING PARALLEL DOWNLOAD TESTS ===
Provider: aws
=== Testing Parallel Download (Path) ===
SLF4J(W): No SLF4J providers were found.
SLF4J(W): Defaulting to no-operation (NOP) logger implementation
SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J(W): Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J(W): Ignoring binding found at [jar:file:/Users/barry.liu/.m2/repository/org/slf4j/slf4j-simple/1.7.2/slf4j-simple-1.7.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J(W): See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
[parallelDownloadToPath] Uploading test blob: key=parallel-download-path-1776873590056.bin, size=16777216
[parallelDownloadToPath] key=parallel-download-path-1776873590056.bin, bytes=16777216, elapsedMs=2067, etag="e06888832b83d4f381e885edebd1d782"
[parallelDownloadToPath] SUCCESS
[parallelDownloadToPath] remote blob retained for inspection: parallel-download-path-1776873590056.bin
=== Testing Parallel Download (createParentPath) ===
[parallelDownloadWithCreateParentPath] Uploading nested-key blob: key=parallel/nested/dir/blob-1776873595717.bin
[parallelDownloadWithCreateParentPath] resolvedPath=/tmp/parallel-download-root-1776873595717/parallel/nested/dir/blob-1776873595717.bin, bytes=8388608, etag="57b019a28c426df5727b3992701bd2be"
[parallelDownloadWithCreateParentPath] SUCCESS
[parallelDownloadWithCreateParentPath] remote blob retained for inspection: parallel/nested/dir/blob-1776873595717.bin
=== Testing Parallel Download (with Range) ===
[parallelDownloadWithRange] key=parallel-download-range-1776873598786.bin, bytes=3073, expected=3073
[parallelDownloadWithRange] SUCCESS
[parallelDownloadWithRange] remote blob retained for inspection: parallel-download-range-1776873598786.bin
=== Testing Parallel vs Sequential Download ===
[parallelVsSequentialDownload] sequentialMs=3321, parallelMs=3580, sizeBytes=33554432, seqDownloadedBytes=33554432, parDownloadedBytes=33554432
[parallelVsSequentialDownload] SUCCESS
[parallelVsSequentialDownload] remote blob retained for inspection: parallel-vs-sequential-1776873600509.bin
Parallel download tests completed!

Screenshot 2026-04-22 at 12 05 13 PM

Some conventions to follow

  1. add the module name as a prefix
    • for example: add a prefix: docstore: for document store module, blobstore for Blob Store module
  2. for a test only PR, add test:
  3. for a perf improvement only PR, add perf:
  4. for a refactoring only PR, add "refactor:"

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 37.66234% with 96 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.49%. Comparing base (d8556bc) to head (26ce906).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
.../salesforce/multicloudj/blob/gcp/GcpBlobStore.java 26.80% 62 Missing and 9 partials ⚠️
...oudj/blob/async/driver/AbstractAsyncBlobStore.java 10.00% 8 Missing and 1 partial ⚠️
.../multicloudj/blob/aws/async/AwsAsyncBlobStore.java 33.33% 5 Missing and 1 partial ⚠️
...alesforce/multicloudj/blob/aws/AwsTransformer.java 0.00% 4 Missing ⚠️
...rce/multicloudj/blob/driver/AbstractBlobStore.java 70.00% 2 Missing and 1 partial ⚠️
...e/multicloudj/blob/inmemory/InMemoryBlobStore.java 75.00% 2 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (37.66%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##               main     #377      +/-   ##
============================================
- Coverage     82.10%   81.49%   -0.62%     
- Complexity      641      643       +2     
============================================
  Files           194      194              
  Lines         11798    11994     +196     
  Branches       1572     1605      +33     
============================================
+ Hits           9687     9774      +87     
- Misses         1434     1530      +96     
- Partials        677      690      +13     
Flag Coverage Δ
unittests 81.49% <37.66%> (-0.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@iamabhilaksh
Copy link
Copy Markdown
Contributor

iamabhilaksh commented Apr 9, 2026

This looks solid. Could you share the testing logs or a brief summary of how this was verified on AWS/GCP? Especially interested in seeing the parallel download and createParentPath behavior in action. 👍

import software.amazon.awssdk.http.apache.ApacheHttpClient;
import software.amazon.awssdk.http.apache.ProxyConfiguration;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.s3.S3AsyncClient;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a bit odd - we shouldn't mix sync and async in here. AwsBlobStore in meant for sync operations

Copy link
Copy Markdown
Collaborator Author

@LihaoLiuXs LihaoLiuXs Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move the parallel download option to asyncClient.

Comment on lines +90 to +91
import software.amazon.awssdk.transfer.s3.S3TransferManager;
import software.amazon.awssdk.transfer.s3.model.DownloadFileRequest;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Copy Markdown
Contributor

@sandeepvinayak sandeepvinayak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the client wants to do parallel downloads - they need to use async client. We shouldn't mix these two - it will get into the high maintainance sphagetti code.

@LihaoLiuXs
Copy link
Copy Markdown
Collaborator Author

if the client wants to do parallel downloads - they need to use async client. We shouldn't mix these two - it will get into the high maintainance sphagetti code.

Agreed. I moved the parallel download option to asyncClient.

* When createParentPath is enabled, resolves the final file path by appending
* the object key to the destination root and creating any missing parent directories.
*/
private Path resolveDownloadDestinationPath(DownloadRequest request, Path destination) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it;s not resolving, it's creating.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the name.

Path destinationPath = resolveDownloadDestinationPath(request, path);
GetObjectRequest getObjectRequest = transformer.toRequest(request);
if (request.isParallelDownload()) {
DownloadFileRequest downloadFileRequest =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move it to transformer

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to transformer.

}

/**
* (Optional) If true, provider implementations may preserve the key's parent path structure in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it creates paths and we should explicitly mention it

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment for this


/**
* Worker count for {@link TransferManager} divide-and-conquer downloads, analogous to tuning
* gsutil sliced / parallel Range GET downloads for large objects (see Google Cloud Performance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove any unlrelated details,we have nothing to do gsutil

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the unrelated comments

* gsutil sliced / parallel Range GET downloads for large objects (see Google Cloud Performance
* Atlas: Optimizing your Cloud Storage performance).
*/
private static final int PARALLEL_LARGE_FILE_DOWNLOAD_MAX_WORKERS = 8;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we chose 8 ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 workers will give us 8 workers × 16 MiB chunks ≈ 800 MB/s aggregate ≈ 6.4 Gbps, which saturates the GCE Compute Engine default per-VM egress cap (7 Gbps).

  1. For machine series that support multiple physical NICs, such as A3, A4, A4X, and A4X Max instances, 7 Gbps per NIC: https://docs.cloud.google.com/compute/docs/network-bandwidth#egress_outside_vpc

  2. Number of work options: 4, 8, 16, 22, 32, 48. transfer_manager sliced download is slow on cloud run googleapis/python-storage#1093

*/
private static final int PARALLEL_LARGE_FILE_DOWNLOAD_MAX_WORKERS = 8;

private static final String PARALLEL_STRIP_PREFIX_TO_BASENAME = "^.*/";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the var name and value doesn't align

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value matches everything up to and including the last '/' in an object key. For example for object key data/2026/04/report.csv, "^.*/" will match the data/2026/04/, and as result we will only keep the report.csv. I will rename this var to OBJECT_KEY_DIRECTORY_PREFIX_REGEX.

Comment on lines +234 to +235
// parallelDownload not supported: delegates to OutputStream which
// cannot accept parallel range-GET writes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove this comment, the docs are added in abstract

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines +297 to +312
private Path resolveDownloadDestinationPath(DownloadRequest request, Path destination) {
if (!request.isCreateParentPath()) {
return destination;
}
// Keep key parent structure under the provided local destination root when requested.
Path resolved = destination.resolve(request.getKey()).normalize();
Path parent = resolved.getParent();
if (parent != null) {
try {
Files.createDirectories(parent);
} catch (IOException e) {
throw new SubstrateSdkException("Failed to create destination directories", e);
}
}
return resolved;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is already common, should go to abstract class

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to AbstractBlobStore and AbstractAsyncBlobStore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants